Skip to content

refactor: update references from SEQREF to SEQRES and aa_sequence#79

Open
St3451 wants to merge 1 commit intomasterfrom
refactor/aa-sequence-column
Open

refactor: update references from SEQREF to SEQRES and aa_sequence#79
St3451 wants to merge 1 commit intomasterfrom
refactor/aa-sequence-column

Conversation

@St3451
Copy link
Collaborator

@St3451 St3451 commented Feb 15, 2026

Summary

  • Replace the misleading refseq metadata column (previously used for amino-acid sequences) with a single consistent name: aa_sequence.
  • Keep refseq_prot unchanged (it remains the RefSeq protein accession).

Still missing full run of tests including:

  • update_samplesheet_and_structures.py
  • build-datasets
  • run

Copilot

This pull request makes several important changes to standardize the naming of sequence columns and improve SEQRES record handling throughout the dataset processing scripts. The most significant updates include renaming the refseq column to aa_sequence, updating related functions and documentation, and enhancing error handling for missing metadata. Below are the key changes grouped by theme:

Column Renaming and Data Consistency

  • Renamed the refseq column to aa_sequence throughout the codebase, including in DataFrame construction, metadata attachment, and FASTA file writing. This affects functions such as _parse_ncbi_mane_fasta, write_fastas_and_update_sheet, and attach_aa_sequence [1] [2] [3] [4] [5] [6].
  • Updated function and variable names, as well as docstrings, to reflect the new aa_sequence naming convention [1] [2].

SEQRES Record Handling

  • Standardized terminology and function names related to SEQRES records in PDB files, replacing REFSEQ and SEQREF with SEQRES in comments, log messages, and function names [1] [2] [3] [4].

Metadata Validation and Logging

  • Added validation for required columns in custom MANE metadata files, raising a clear error if sequence or aa_sequence columns are missing.
  • Improved logging messages to clarify when SEQRES insertion is skipped due to missing or unavailable metadata.

These changes collectively improve clarity, maintain consistency across scripts, and ensure robust handling of sequence and metadata information.

Copilot AI review requested due to automatic review settings February 15, 2026 04:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the codebase to use consistent and accurate naming for sequence-related data. The main changes standardize column names and terminology throughout the dataset processing scripts to avoid confusion between amino-acid sequences and RefSeq protein accessions.

Changes:

  • Renamed the refseq column to aa_sequence throughout the codebase to clearly identify amino-acid sequence data
  • Updated all function names, variable names, and docstrings to reflect the new aa_sequence naming convention
  • Standardized terminology from SEQREF/REFSEQ to SEQRES in comments and function names to match PDB file format specifications
  • Added validation for required columns in custom MANE metadata files
  • Fixed bugs in logging messages that referenced undefined variables

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
tools/preprocessing/update_samplesheet_and_structures.py Renamed attach_refseq function to attach_aa_sequence, updated internal variable names and column references from refseq to aa_sequence
tools/preprocessing/prepare_samplesheet.py Updated _parse_ncbi_mane_fasta to create DataFrame with aa_sequence column; updated write_fastas_and_update_sheet to use aa_sequence column
scripts/datasets/custom_pdb.py Added validation for required columns (sequence and aa_sequence); updated column access to use aa_sequence; improved logging messages by removing references to undefined variables; updated comments to use SEQRES terminology
scripts/datasets/af_merge.py Renamed add_refseq_record_to_pdb function to add_seqres_records_to_pdb; updated comments and docstrings to use SEQRES terminology; fixed typo in comment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments